********************************************************************************************************************
********************************************************************************************************************
********************************************************************************************************************
* Do file for Phil Magness and Michael Makovi, "The Mainstreaming of Marx: Measuring the Effect of the Russian Revolution on Karl Marx’s Influence"
* Code by Michael Makovi
********************************************************************************************************************
********************************************************************************************************************
********************************************************************************************************************
********************************************************************************************************************
clear
version 17.0

/**********************************************************
Change the current directory to the master folder containing this whole project,
and create a global variable "path" ($path) containing the path to this directory.

Each one of our synthetic control regression results comprises many separate files, 
so each separate SCM regression has its own folder. Saving the master path in a 
global variable will facilitate changing directories.

This do file is contained within a folder named "do files", that is in turn
contained in the master folder. So move up one directory and then save the 
name of the current directory. By saving the name of the master folder, we
facilitate changing directories later.
*********************************************************/
clear

* By default, we begin in the directory containing this do file.
* So move up one directory, to the master directory for this paper or project.
cd ..

* Now, c(pwd) contains the name of the current directory. This can be verified by disp c(pwd) or by creturn list
global path = c(pwd)
disp "$path"
disp c(pwd)

* Create a "results" folder if one does not already exist.
clear
cd "$path"
capture mkdir "results"

* Begin logging.
log using "$path\results\Stata output log.log", replace

clear

********************************************************************************************************************
* For quick bug-testing, change the following global from 0 to 1.
* When set to 1, the dataset will be restricted to a limited set of authors,
* and it will change convergence to "technique(dfp)" regardless of what it had previously been set to.
********************************************************************************************************************
global quick_bug_testing = 0

/**********************************************************
Create a global variable specifying the convergence parameter.
Some convergence criteria - esp. nested - will make the regressions 
take far longer, so it is useful to run this do-file with non-nested
for bug-testing. Once it is verified that the entire do-file runs 
without error, the parameter can be changed to nested. By defining a 
global variable here, we only have to change one line in order
to change the convergence parameter for all our regressions at once.

The synthetic control method takes a convergence parameter:
unspecified (non-nested), "nested", and "nested allopt". 

And optionally, it takes parameters from Stata's ml optimizer, such as
"margin(#)", "technique(string)", and "maxiter(#)"

margin(#) indicates a constraint violation tolerance; specifying a small number
results in fewer graceful failures to converge.

The default technique (nr dfp bfgs) is far slower and uses much
more RAM than two alternatives: DFP (Davidon–Fletcher–Powell) 
and BFGS (Broyden–Fletcher–Goldfarb–Shanno). (On synth's default technique,
check the source code via "view c:\ado\plus\s\synth.ado" - you'll see it's 
"technique (nr dfp bfgs)". To find the location of synth.ado, type "which synth".)

Finally, setting maxiter(#) can prevent interminable hangs. When we enlarged our 
dataset by adding the second 20 vol of the Harvard Classics and the German authors, 
we found that a few parallel synth_runner threads would hang, failing to converge 
or even suffer graceful failure to converge, after many, many hours. But we found 
that even as few as 5 to 10 iterations is often enough to get a very good fit, and 
still better than the non-nested fit. And setting a max number of iterations did 
prevent these endless loops. But strangely, each extra iteration seems to increase 
the time required non-linearly. E.g., 15 iterationsfor "Karl Marx" synth takes about 
5 times longer than 10 iterations. So it seems that the extra iterations are greedy 
somehow. Strangely, setting margin(#) did not solve the problem of these endlessly 
repeating iterations. In light of all these, we set maxiter(20).

Note that all of our std p-values are discounted by the pre-treatment RMSPE. So
if any of these options do result in a small number of donors getting bad 
pre-treatment fits, then this should not affect our results too much, because
their post-treatment treatment effects will be discounted by their large 
pre-treatment RMSPE. So e.g., if maxiter(20) causes a unit to get horrible fit,
then that unit should be discounted by a correspondingly large amount. 
**********************************************************/
global convergence = "nested allopt technique(dfp) maxiter(20)"

* In case we change $convergence at some point (e.g. for a robustness test), we also save it as a duplicate backup.
global convergence_backup = "$convergence"

* We define a lower level of convergence, with fewer iterations, for the 
* normalization tests because there are 39 separate synth_runner regressions,
* and they collectively take several days at 15 iterations.
global convergence_normalization = "nested allopt technique(dfp) maxiter(15)"

if ($quick_bug_testing == 1) {
	global convergence = "technique(dfp)"
	global convergence_backup = "technique(dfp)"
	global convergence_normalization = "technique(dfp)"
}

/*********************************************************
For maximum reproducibility of results, 
we do NOT use the Intel Math Kernel Library 
See https://www.stata.com/manuals/m-1lapack.pdf

We set Mata to prefer maximizing speed over minimizing memory
See https://www.stata.com/manuals13/m-3mataset.pdf

We set "lapack_mkl off" and "mata set matafavor speed" 
permanently because the parallel module works by opening
additional instances of Stata, and we need these to have the 
same settings. 

Therefore, we save the savings which existed prior to running this do file, and
restore them at the end. This ensures that the change is not really permanent.
*********************************************************/
global previous_lapack_setting = c(lapack_mkl)
global previous_matafavor_setting = c(matafavor)

set lapack_mkl off, permanently
mata: mata set matafavor speed, permanently

/*******************************************************************************************************
* Install necessary packages if necessary

* synth by Hainmueller, Abadie, and Diamond
ssc install synth, replace all

* Outdated version of synth:
* net from "https://web.stanford.edu/~jhain/Synth"
* net install synth, all replace force

* synth_runner by Quistorff and Galiani
net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace

* parallel by Vega Yon and Quistorff
net install parallel, from(https://raw.github.com/gvegayon/parallel/master/) replace
mata mata mlib index

* github and rcall by Haghish (github is needed to install rcall)
net install github, from("https://haghish.github.io/github/")
github install haghish/rcall, stable
rcall: install.packages("readstata13", repos="http://cran.uk.r-project.org")
rcall_check 
rcall describe
* rcall setpath "if rcall cannot detect your R installation, then set the path here to R.exe"

* Wilson's harmonic mean p-value, implemented in R.
rcall: install.packages("harmonicmeanp", repos="http://cran.uk.r-project.org")

* estout by Ben Jann, in order to use estadd, to add results to e() matrices.
ssc install estout, replace

* matsave is not used, but it was potentially useful for our purposes, so this is kept as a reminder in case we ever need it
* ssc install matsave, replace all
********************************************************************************************************/

* Even if harmonicmeanp is installed, it must still be loaded by R.
rcall: library(harmonicmeanp)

update query 

/**********************************************************
Do some more system setup
**********************************************************/

* Turn off the "more" message when output is long
* Enable timer that will indicate execution time of all commands
* Set a seed so any results involving randomization will be replicable
set more off
set rmsg on
set seed 8675309

* Set the parallel module to use the number of logical processors minus 1, so that there's still some processing capacity left for the operating
* system and background process. But if (number of processors minus 1) is 0 or negative, then we use one processor
parallel numprocessors
local numprocessors_to_use = r(numprocessors)-1 
if (`numprocessors_to_use' <= 0) {
	local numprocessors_to_use = 1
}
parallel setclusters `numprocessors_to_use'

* if Stata/MP, use all of the processors available - the number of licensed cores or the number of logical processors, whichever is smaller
local is_MP = c(MP)
if (`is_MP') {
	local licensed_cores = c(processors_max)
	local logical_cores = c(processors_mach) 
	if (`licensed_cores' <= `logical_cores') {
		set processors `licensed_cores'
	}
	else {
		set processors `logical_cores'
	}
}
local is_MP = ""
local processors_max = ""


clear

/*************************************************************************************************************************************************************
Load data and create different snapshots with different configurations of the data --- using "snapshot"

Each of following snapshots configures the data in a way that is convenient for several regressions, so that each regression only needs to make slight changes to 
the data, rather than redoing all the configurations. E.g., we have a snapshot called "Karl Marx, 1878-1932", which drops "Marx" and keeps the years "1878-1932."
This is used for the vast majority of our regressions. When we drop "Kropotkin" for robustness, we can just load "Karl Marx, 1878-1932" and then drop "Kropotkin,"
without worrying about dropping "Marx" and keeping "1878-1932." 

Each snapshot has its own encoding of string Name ---> long Name_no. This way, if you use a given snapshot, the encoding is always the same for that snapshot. 
So this makes it easier to keep track of which name corresponds to which number. Just load a snapshot and tabulate Name_no_Name.

In each snapshot, we also normalize citation counts, YearofPublication, and YearofTranslationtoEnglish to the range [0,1] to accelerate convergence of numerical estimation.

In Stata 16 (and newer), the new frames utility is one way of achieving this - especially because you refer to frames by name, not number, so you don't have to
worry about loading the wrong frame if you create another one. But the problem with frames is that if you "clear" the data, you clear all the data in that frame.
By contrast, with snapshots, if you clear the data, the snapshots themselves are not cleared. So if you clear data when using snapshots, you can just 
restore the snapshot. But if you clear data when using frames, your data are lost, and you have to recreate the whole frame. So for that reason, we prefer 
snapshots. Therefore, we use snapshots with a trick that still allows me to refer to snapshots by name, not number.

To achieve the best of both worlds - the un-clearability of snapshots with the frame's ability to refer by name and not by number, we do the following:
After we create a snapshot, we call "quietly snapshot list", which saves the total number of snapshots in r(snapshot). Then we create a new global macro with
a descriptive name, and a value equal to r(snapshot). That way, each snapshot number is associated with a global macro with a descriptive name.
Then, we can snapshot restore $global_descriptive_name rather than snapshot restore #. And even if we change the snapshot code, such as by adding another snapshot
in the middle of preexisting snapshots, so that the numbering of snapshots changes, the descriptive names won't change. As long as we create each global immediately 
after snapshot list, the value of r(snapshot) will correctly refer to that new snapshot.
*************************************************************************************************************************************************************/

* Program to normalize citation counts, YearofPublication, and YearofTranslationtoEnglish to the range [0,1] to accelerate convergence of numerical estimation.
capture program drop NormalizeVariables
program define NormalizeVariables

	* Assert that the sample minimum of each our citations is zero. This ensures that after min-max normalizing,
	* we can still interpret citations as an absolute scale. E.g., 0.5 citations will be double 0.25 citations.
	* If the minimum were not zero, then min-max normalization fails to provide an absolute scale. Then instead,
	* we would have to rescale by dividing every citation by the maximum, so that the maximum is 1, and the new
	* minimum is min/max. This would preserve an absolute scale where 2X implies twice X.
	foreach variable in cite_English cite_German cite_French cite_Spanish cite_news {
		summarize `variable'
		assert r(min) == 0
	}

	foreach variable in cite_English cite_German cite_French cite_Spanish cite_news YearofPublication YearofTranslationtoEnglish {
		capture quietly summarize `variable'
		capture replace `variable' = (`variable' - r(min)) / (r(max)-r(min))
	}
end

* Program to encode a string Name into a number Name_no, and tsset 
* We don't actually call this program until right before we run each regression, because 
* if we number the units and then drop any units, we'll have gaps in the numbering,
* which makes it difficult to tell which unit is hanging or failing to converge if
* one fails to converge.
capture program drop NumberUnits 
program define NumberUnits 
	encode Name, generate(Name_no)
	gen hyphen = "-"
	egen Name_no_Name = concat(Name_no hyphen Name)
	drop hyphen
	tabulate Name_no_Name
	tsset Name_no Year
end

* Create a snapshot of all authors and all years, including Hegel 
clear
cd "$path"
use "$path\data files\all_authors_with_citations_and_indicators.dta"
* If quick bug-testing, restrict the dataset to a small set of authors that includes 
* at least 1 of every value of OriginalLanguage (to avoid the dummy trap in our regressions), 
* all socialists (to avoid tripping the code where we run each socialist's synth, one-by-one),
* and set global convergence to "technique(dfp)".
keep Name Year cite_* Political Socialist OriginalLanguage wrote_* Yearof* author_set 
order Name Year cite_* Political Socialist OriginalLanguage wrote_* Yearof* author_set
if ($quick_bug_testing == 1) {
	keep if author_set == 1 | OriginalLanguage == "Spanish" | OriginalLanguage == "Russian" | Socialist == 1
}
NormalizeVariables
compress
snapshot save, label("All authors, all years, including Hegel")
quietly snapshot list
global all_authors_all_years_inc_Hegel = r(snapshot)

* Create a snapshot of all authors and all years, excluding Hegel, because cannot distinguish
* the father (Georg Wilhelm Friedrich Hegel) and the son (Friedrich Wilhelm Karl, Ritter von Hegel). 
* In Google Ngram, "Friedrich Hegel" and "Karl Hegel" appear with almost equal frequency.
clear
snapshot restore $all_authors_all_years_inc_Hegel
drop if Name == "Hegel"
NormalizeVariables
compress
snapshot save, label("All authors, all years, excluding Hegel")
quietly snapshot list
global all_authors_all_years_excl_Hegel = r(snapshot)

* Snapshot: restrict data to 1878-1932
* For all of our synth regressions, we can specify the x-period for estimations, 
* but then the graph still includes every year in our dataset, which is visually annoying
* (even though it has no affect on our actual results)
* So let's just drop all the years we won't use, so that our graphs will look nice
clear
snapshot restore $all_authors_all_years_excl_Hegel
drop if Year < 1878
drop if Year > 1932
NormalizeVariables
compress
snapshot save, label("All authors (excl Hegel), 1878-1932")
quietly snapshot list
global all_authors_1878_1932 = r(snapshot)

* Snapshot: years are 1878-1932 and treatment is "Karl Marx" so drop "Marx". All languages.
clear
snapshot restore $all_authors_1878_1932
drop if Name == "Marx"
NormalizeVariables
compress
snapshot save, label("Karl Marx, 1878-1932, all languages")
quietly snapshot list
global Karl_Marx_1878_1932 = r(snapshot)

* Snapshot: 
/*
Years are 1878-1932 
Treatment is "Karl Marx" so drop "Marx"
Citations in English, so keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)
*/
clear 
snapshot restore $Karl_Marx_1878_1932
keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)
NormalizeVariables
compress 
snapshot save, label("Karl Marx, 1878-1932, English")
quietly snapshot list
global Karl_Marx_1878_1932_English = r(snapshot)

* Snapshot: treatment is "Marx" so drop "Karl Marx", but otherwise the same as snapshot $Karl_Marx_1878_1932_English.
clear
snapshot restore $all_authors_1878_1932
drop if Name == "Karl Marx"
keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)
NormalizeVariables
compress 
snapshot save, label("Marx, 1878-1932, English")
quietly snapshot list
global Marx_1878_1932_English = r(snapshot)

snapshot list
macro list
disp $all_authors_all_years_inc_Hegel $all_authors_all_years_excl_Hegel $all_authors_1878_1932 $Karl_Marx_1878_1932 $Karl_Marx_1878_1932_English $Marx_1878_1932_English

* Write the current convergence settings to a file 
* We put this line here, in case the global quick_bug_testing has been used to change the global convergence.
capture erase "$path\results\convergence_settings.txt"
file open convergence_settings using "$path\results\convergence_settings.txt", write  
file write convergence_settings "convergence: $convergence" _newline 
file write convergence_settings "convergence for normalization test: $convergence_normalization" _newline
file write convergence_settings "bug-testing mode (1/0) (reduced dataset, sets convergence precision to minimum regardless of previous convergence setting): $quick_bug_testing" _newline 
file close convergence_settings

**************************************************************************************************************************************************************
* Write some programs we will call frequently
**********************************************************************************************************************************************************
* Load two programs from "save_synth_and_synth_runner_output.do": "save_synth_output" and "save_synth_runner_output"
* These two programs will save all output from "synth" and "synth_runner" where the current directory is
do "$path\do files\save_synth_and_synth_runner_output.do"

* Here, we define a baseline regression that we will often use. Whenever we make a slight change to our regression, e.g. by dropping a single donor,
* we can just call this program instead of repeating the same code.
* This regression here assumes treatment in 1917 and a total period of 1878-1932
* It takes the treatment unit # as an argument, so that if we change our sample slightly (e.g. dropping a donor), which changes the treatment unit, we can still use this same regression.
*
* Outcome = cite_English - i.e. annual citations without smoothing (i.e. averaging with adjacent years)
*
* Indicators = YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political
*
* Indicators also = cite_English in various years
* We use cite_English averaged in years X, X+1, and X+2; and then again, averaging in years X+6, X+7, and X+8. 
* This is so that we use data for three years, then we stop tracking for 3 years, and then we track for 3 years again. 
* So half the pre-treatment period is tracked, in a yes/no/yes/no pattern.
*
* Thus, for most of our regressions, we use this list of indicators:
* YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political
* cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886)
* 
* trperiod = 1917 i.e. Russian Revolution
* mspeperiod(1878(1)1916)
* 
* We get our p-values from synth_runner
capture program drop baseline_regression
program define baseline_regression
	
	synth cite_English ///
		YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
		cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
		trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
		fig keep("synth_results") replace
		
		save_synth_output
	
	synth_runner cite_English ///
		YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
		cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
		trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
		noredo_tr_error ///
		gen_vars ///
		keep("synth_runner_results") replace ///
		parallel 
		
		save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean	
end

clear

**************************************************************************************************************************************************************
* Baseline synthetic control. 
**************************************************************************************************************************************************************
* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Create output folder if it does not already exist
cd "$path\results"
capture mkdir "Treatment 1917"
cd "$path\results\Treatment 1917"
baseline_regression

* Manipulate the "Outcomes" tab of the "synth_results.xlsx" spreadsheet, to add percent differences in outcomes.
* Save the results in another sheet, named "Outcomes2".
* 
* Import the "Outcomes" sheet of the "synth_results.xlsx" spreadsheet
preserve
clear 
import excel "$path\results\Treatment 1917\synth_results.xlsx", sheet("Outcomes") cellrange(A4) firstrow
ren A Year 
drop C D
destring Year, replace
gen PctDiff = .
set obs 59
replace Year = 0 in 56 /* Will become a blank line in the spreadsheet */
replace Year = 0 in 57 /* Will become a string named "Averages" in the spreadsheet */
replace Year = 18781916 in 58 /* Will have a hyphen in the spreadsheet */
replace Year = 19171932 in 59 /* Will have a hyphen in the spreadsheet */
mkmat YTreated 
mkmat YSynthetic
mata 

	YTreated = st_matrix("YTreated")
	YSynthetic = st_matrix("YSynthetic")

	YTreated_pretreatment = YTreated[1..39,]
	YSynthetic_pretreatment = YSynthetic[1..39,]
	
	YTreated_posttreatment = YTreated[40..55,]
	YSynthetic_posttreatment = YSynthetic[40..55,]
	
	YTreated_pretreatment_avg = colsum(YTreated_pretreatment)/rows(YTreated_pretreatment)
	YSynthetic_pretreatment_avg = colsum(YSynthetic_pretreatment)/rows(YSynthetic_pretreatment)
	YTreated_posttreatment_avg = colsum(YTreated_posttreatment)/rows(YTreated_posttreatment)
	YSynthetic_posttreatment_avg = colsum(YSynthetic_posttreatment)/rows(YSynthetic_posttreatment)
	
	st_matrix("YTreated_pretreatment_avg",YTreated_pretreatment_avg)
	st_matrix("YSynthetic_pretreatment_avg",YSynthetic_pretreatment_avg)
	st_matrix("YTreated_posttreatment_avg",YTreated_posttreatment_avg)
	st_matrix("YSynthetic_posttreatment_avg",YSynthetic_posttreatment_avg)
end
replace YTreated = YTreated_pretreatment_avg[1,1] if Year == 18781916
replace YSynthetic = YSynthetic_pretreatment_avg[1,1] if Year == 18781916
replace YTreated = YTreated_posttreatment_avg[1,1] if Year == 19171932
replace YSynthetic = YSynthetic_posttreatment_avg[1,1] if Year == 19171932
replace PctDiff = (YTreated-YSynthetic)/YSynthetic*100
mkmat Year YTreated YSynthetic PctDiff
putexcel set "$path\results\Treatment 1917\synth_results.xlsx", modify sheet("Outomes2", replace) 
putexcel A1 = "Year"
putexcel B1 = "Y-Actual"
putexcel C1 = "Y-Synthetic"
putexcel D1 = "% Difference"
putexcel D2 = "(Base = Synthetic)"
putexcel A3 = matrix(Year)
putexcel B3 = matrix(YTreated)
putexcel C3 = matrix(YSynthetic)
putexcel D3 = matrix(PctDiff)
putexcel A58 = ""
putexcel A59 = "Averages"
putexcel A60 = "1878-1916"
putexcel A61 = "1917-1932"
putexcel save
mata: mata clear
restore

**************************************************************************************************************************************************************
* Drop the two donors who received the largest weights in the previous regression
* Instead of using the estimation results from the previous synth, we just read in the Excel output to determine who the largest donors were.
**************************************************************************************************************************************************************
* Preserve the dataset from the first regression to ensure we use the same unit numbering, since we have to drop units by number.
preserve

* Load the synth results from the first regression to find the numbering of the units which received the largest weights.
clear
cd "$path\results\Treatment 1917"
import excel "synth_results.xlsx", sheet("Donor Weights") cellrange(A3) firstrow

ren A Name
ren _Co_Number ID
ren _W_Weight Weight

* Sort descending, so invert, then sort, then invert 
replace Weight = -1 * Weight
sort Weight
replace Weight = -1 * Weight

* The values of Name and ID in rows 1 and 2 are the values of ID for the donor who received the greatest weight 
global greatest_donors_Name1 = Name[1]
	global greatest_donors_Name2 = Name[2]
global greatest_donors_ID1 = ID[1]
	global greatest_donors_ID2 = ID[2]
	
* Use snapshot 4, "Karl Marx, 1878-1932, English"
* Here, we make an exception to our general rule of not calling NumberUnits until after everyone 
* to be dropped has been dropped. We need to make sure that our numbering is the same as that used in 
* the previous regression, because we drop units by number.
* clear
* snapshot restore $Karl_Marx_1878_1932_English
* NumberUnits

* Restore the dataset from the first regression.
restore 

tabulate Name if Name_no == $greatest_donors_ID1 | Name_no == $greatest_donors_ID2
drop if Name_no == $greatest_donors_ID1 | Name_no == $greatest_donors_ID2 

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results"
capture mkdir "Treatment 1917 - drop 2 biggest donors ie $greatest_donors_Name1 & $greatest_donors_Name2"
cd "Treatment 1917 - drop 2 biggest donors ie $greatest_donors_Name1 & $greatest_donors_Name2"
baseline_regression

global greatest_donors_Name1 = ""
global greatest_donors_Name2 = ""
global greatest_donors_ID1 = ""
global greatest_donors_ID1 = ""

**************************************************************************************************************************************************************
* Robustness: include Hegel
* We are not sure that "Hegel" refer to Georg Wilhelm Friedrich Hegel, or whether it refers to his son, Friedrich Wilhelm Karl, Ritter von Hegel.
* Therefore, we have excluded Hegel. But to be sure that excluding Hegel does not bias our results, we include Hegel and repeat our regression.
**************************************************************************************************************************************************************
* Use snapshot 1, "All authors, all years, including Hegel"; then we drop years outside 1878-1932 and drop Name == "Marx"
clear
snapshot restore $all_authors_all_years_inc_Hegel
drop if Year < 1878
drop if Year > 1932
drop if Name == "Marx"
keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)
NormalizeVariables
compress

NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Create output folder if it does not already exist
cd "$path\results"
capture mkdir "Treatment 1917 incl Hegel"
cd "$path\results\Treatment 1917 incl Hegel"
baseline_regression

**************************************************************************************************************************************************************
* Test robustness to choice of V-matrix (indicator variable weights)
*
* A recent paper claims that the SCM algorithm does not yield a unique solution for the V-matrix. Arbitrary reordering of the indicator variables and/or the
* donors can result in a slightly different V-matrix. They propose an algorithm to obtain a unique V, but this algorithm often results in a corner solution, with 
* nearly all or even all the weight being given to a single indicator variable, which is undesirable. Among other solutions, they suggest testing robustness to
* different V matrices, including equal weights.
* 
* "Design Flaw of the Synthetic Control Method" (working paper)
* Kuosmanen, Timo and Zhou, Xun and Eskelinen, Juha and Malo, Pekka
* https://mpra.ub.uni-muenchen.de/106390/14/MPRA_paper_106390.pdf?fbclid=IwAR1sR1sVVRlry9m-BuWtWl5wsZ_BM8kN6DmxpaXcjWO9siLKdD1QpfycaEU
* 28 February 2021
*
* Therefore, we test two choices of V: non-nested (which treats regression-based weights as final), and equal weights.
**************************************************************************************************************************************************************
* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Create output folder if it does not already exist
cd "$path\results"
capture mkdir "Alternative V-matrix"
cd "$path\results\Alternative V-matrix"

* Non-nested 
capture mkdir "Non-nested (regression weights)"
cd "$path\results\Alternative V-matrix\Non-nested (regression weights)"
synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
		
	save_synth_output
	
synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel 
		
	save_synth_runner_output
	
* Cleanup
drop lead cite_English_synth effect pre_rmspe post_rmspe
parallel clean	

* Equal weights
cd ..
capture mkdir "Equal V-weights"
cd "$path\results\Alternative V-matrix\Equal V-weights"
synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace ///
	customV(0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 ///
			0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556)
		
	save_synth_output
	
synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel ///
	customV(0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 ///
			0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556 0.05555555556)
	
	save_synth_runner_output

	
* A related robustness check: does (pseudo-)randomizing the order of the donors and the indicators change the results? 
* Often, it does. But in our case, it does not. After changing the order of the donors and the indicators, we obtain
* the same results as non-nested from above.
*
* Here, we reverse the order of the donors, while keeping Karl Marx's number the same. 
*
* We also change the order of the indicator variables. We have two lists of indicator variables: outcomes and author descriptors. 
* We reverse the order of variables within each list, and then we swap the two lists. 
*
* In the results, the author weights are the same as the non-nested results, but they are labeled with different names than before. 
*
* Nevertheless, we obtain the same results as non-nested. 
*
* Note that the authors' names now appear wrong. We changed their numbers, but Stata is still mapping names to the same numbers
* as before. To compare the two sets of results, compare the non-nested donor weights from the top to these new results from the bottom.
* Since we reversed the numbers of the donors, the two sets of donor weights are the same as long as you reverse them. But ignore the names,
* since the names are wrong.
cd ..
capture mkdir "Change order of authors & indicators"
cd "$path\results\Alternative V-matrix\Change order of authors & indicators"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits
replace Name_no = Name_no - 195 if Name != "Karl Marx"
replace Name_no = -195 if Name_no == 0
replace Name_no = -196 if Name_no == 1
replace Name_no = Name_no * -1 if Name != "Karl Marx"
replace Name_no = 197 if Name_no == $trunit & Name != "Karl Marx" 
sort Year Name_no
tsset Name_no Year
synth cite_English ///
	cite_English(1878(1)1880) cite_English(1884(1)1886) cite_English(1890(1)1892) cite_English(1896(1)1898) cite_English(1902(1)1904) cite_English(1908(1)1910) cite_English(1914(1)1916) ///
	Political Socialist YearofTranslationtoEnglish wrote_Spanish wrote_Italian wrote_Latin wrote_Greek wrote_French wrote_German wrote_English YearofPublication, ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace	
	
	save_synth_output
	
**************************************************************************************************************************************************************
* Citations from newspapers
**************************************************************************************************************************************************************

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

* Keep only the authors who are not missing cite_news.
by Name_no: drop if missing(cite_news)

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Create output folder if it does not already exist
cd "$path\results"
capture mkdir "Newspaper citations"
cd "$path\results\Newspaper citations"

* Since our newspaper dataset is small, we use technique(nr) instead of dfp or bfgs, unless $quick_bug_testing is 1.
global convergence_newspaper = "nested allopt technique(nr)"
if ($quick_bug_testing == 1) {
	global convergence_newspaper = "technique(dfp)"
}

* Check which indicator variables vary within the sample of non-missing values of cite_news.
* It appears we must omit wrote_Greek, wrote_Latin, wrote_Italian, and wrote_Spanish. 
* Of the remaining languages - wrote_English, wrote_German, and wrote_French - we omit one to avoid the dummy trap.
* Usually, we have used Dostoyevsky in Russian in the omitted case, but Dostoyevsky is not present in this sample of cite_news.
summarize YearofPublication YearofTranslationtoEnglish if !missing(cite_news)
foreach variable of varlist OriginalLanguage wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political {
		tabulate `variable' if !missing(cite_news)
	}

synth cite_news ///
	YearofPublication wrote_English wrote_German YearofTranslationtoEnglish Socialist Political ///
	cite_news(1914(1)1916) cite_news(1908(1)1910) cite_news(1902(1)1904) cite_news(1896(1)1898) cite_news(1890(1)1892) cite_news(1884(1)1886) cite_news(1878(1)1880), $convergence_newspaper ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
		
	save_synth_output
	
synth_runner cite_news ///
	YearofPublication wrote_English wrote_German YearofTranslationtoEnglish Socialist Political ///
	cite_news(1914(1)1916) cite_news(1908(1)1910) cite_news(1902(1)1904) cite_news(1896(1)1898) cite_news(1890(1)1892) cite_news(1884(1)1886) cite_news(1878(1)1880), $convergence_newspaper ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel 
		
	save_synth_runner_output
	
* Cleanup
drop lead cite_news_synth effect pre_rmspe post_rmspe
parallel clean	

* Test if newspaper citations are fairly well correlated with Ngram citations.
* We hypothesize that they have different levels, but similar rates of change, so 
* we perform a log-log regression with fixed effects.
generate log_cite_English = log(cite_English)
generate log_cite_news = log(cite_news)
xtreg log_cite_news log_cite_English i.Year, fe

* Graph citations from Ngram and newspapers for Marx together, so that we can visually see if the two measurements are fairly similar,
* meaning they are measuring the same thing.
keep if Name == "Karl Marx"
sort Year 
twoway (connected cite_English Year) (connected cite_news Year)
graph save "Graph" "Ngram & Newspaper.com citations for ''Karl Marx''.gph", replace

clear
	
**************************************************************************************************************************************************************
* Robustness: using citations in German
* We don't have years of translation to German for use as an indicator, and YearofTranslationtoEnglish is not included as an indicator.
* Obviously, we change our outcome and indicators from cite_English to cite_German.
*
* We use two different datasets: one with all authors, and one with only German authors.
*
* We run synth_runner twice for each dataset (all authors, Germans-only), to obtain two sets of p-values: one for the period until 1932, and one for the period until 1950.
* Since synth is run just for the visual graph, we run it only once, for the period until 1950, and this lets us visually see what happens by 1932 too.
**************************************************************************************************************************************************************

*************************************
* All authors, including non-Germans
*************************************

clear
snapshot restore $all_authors_all_years_excl_Hegel
drop if Year < 1878
drop if Year > 1950
drop if Name == "Marx"
drop if missing(cite_German)
NormalizeVariables
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1917, German citations"
cd "$path\results\Treatment 1917, German citations"
capture mkdir "All authors, including non-Germans"
cd "$path\results\Treatment 1917, German citations\All authors, including non-Germans"

synth cite_German ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political /// 
	cite_German(1914(1)1916) cite_German(1908(1)1910) cite_German(1902(1)1904) cite_German(1896(1)1898) cite_German(1890(1)1892) cite_German(1884(1)1886) cite_German(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1950) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
			
	save_synth_output
	
* We run the same synth_runner twice, just with different post-treatment years in memory
capture program drop German_synth_runner
program define German_synth_runner

	synth_runner cite_German ///
		YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political ///
		cite_German(1914(1)1916) cite_German(1908(1)1910) cite_German(1902(1)1904) cite_German(1896(1)1898) cite_German(1890(1)1892) cite_German(1884(1)1886) cite_German(1878(1)1880), $convergence ///
		trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
		noredo_tr_error ///
		gen_vars ///
		keep("synth_runner_results") replace ///
		parallel 
			
		save_synth_runner_output
		
		* Cleanup
		drop lead cite_German_synth effect pre_rmspe post_rmspe
		parallel clean	
		
end
	
* synth_runner to get p-values for the whole period to 1950
capture mkdir "synth_runner to 1950"
cd "$path\results\Treatment 1917, German citations\All authors, including non-Germans\synth_runner to 1950"
German_synth_runner 
	
* synth_runner to get p-values for only the period to 1932
cd "$path\results\Treatment 1917, German citations\All authors, including non-Germans"
capture mkdir "synth_runner to 1932"
cd "$path\results\Treatment 1917, German citations\All authors, including non-Germans\synth_runner to 1932"
drop if Year > 1932 
German_synth_runner

*************************************
* German authors only
* Since all these authors have wrote_German == 1, we eliminate the language indicator variables
*************************************

clear
snapshot restore $all_authors_all_years_excl_Hegel
drop if Year < 1878
drop if Year > 1950
drop if Name == "Marx"
drop if missing(cite_German)
keep if wrote_German == 1
NormalizeVariables
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1917, German citations"
cd "$path\results\Treatment 1917, German citations"
capture mkdir "German authors only"
cd "$path\results\Treatment 1917, German citations\German authors only"

synth cite_German ///
	YearofPublication Socialist Political /// 
	cite_German(1914(1)1916) cite_German(1908(1)1910) cite_German(1902(1)1904) cite_German(1896(1)1898) cite_German(1890(1)1892) cite_German(1884(1)1886) cite_German(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1950) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
			
	save_synth_output
	
* We run the same synth_runner twice, just with different post-treatment years in memory
capture program drop German_synth_runner
program define German_synth_runner

	synth_runner cite_German ///
		YearofPublication Socialist Political ///
		cite_German(1914(1)1916) cite_German(1908(1)1910) cite_German(1902(1)1904) cite_German(1896(1)1898) cite_German(1890(1)1892) cite_German(1884(1)1886) cite_German(1878(1)1880), $convergence ///
		trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
		noredo_tr_error ///
		gen_vars ///
		keep("synth_runner_results") replace ///
		parallel 
			
		save_synth_runner_output
		
		* Cleanup
		drop lead cite_German_synth effect pre_rmspe post_rmspe
		parallel clean	
		
end
	
* synth_runner to get p-values for the whole period to 1950
capture mkdir "synth_runner to 1950"
cd "$path\results\Treatment 1917, German citations\German authors only\synth_runner to 1950"
German_synth_runner	
	
* synth_runner to get p-values for only the period to 1932
cd "$path\results\Treatment 1917, German citations\German authors only"
capture mkdir "synth_runner to 1932"
cd "$path\results\Treatment 1917, German citations\German authors only\synth_runner to 1932"
drop if Year > 1932 
German_synth_runner	
	
**************************************************************************************************************************************************************
* Robustness: using citations in French
* We don't have years of translation to French for use as an indicator, and YearofTranslationtoEnglish is not included as an indicator.
* Obviously, we change our outcome and indicators from cite_English to cite_French.
**************************************************************************************************************************************************************	
	
clear
snapshot restore $all_authors_all_years_excl_Hegel
drop if Year < 1878
drop if Year > 1932
drop if Name == "Marx"
drop if missing(cite_French)
NormalizeVariables
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)
	
cd "$path\results\"
capture mkdir "Treatment 1917, French citations"
cd "$path\results\Treatment 1917, French citations"	
synth cite_French ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political ///
	cite_French(1914(1)1916) cite_French(1908(1)1910) cite_French(1902(1)1904) cite_French(1896(1)1898) cite_French(1890(1)1892) cite_French(1884(1)1886) cite_French(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
		
	save_synth_output
	
synth_runner cite_French ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political ///
	cite_French(1914(1)1916) cite_French(1908(1)1910) cite_French(1902(1)1904) cite_French(1896(1)1898) cite_French(1890(1)1892) cite_French(1884(1)1886) cite_French(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel 
		
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_French_synth effect pre_rmspe post_rmspe
	parallel clean	

**************************************************************************************************************************************************************
* Robustness: using citations in Spanish
* We don't have years of translation to Spanish for use as an indicator, and YearofTranslationtoEnglish is not included as an indicator.
* Obviously, we change our outcome and indicators from cite_English to cite_Spanish.
**************************************************************************************************************************************************************	
	
clear
snapshot restore $all_authors_all_years_excl_Hegel
drop if Year < 1878
drop if Year > 1932
drop if Name == "Marx"
drop if missing(cite_Spanish)
NormalizeVariables
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)
	
cd "$path\results\"
capture mkdir "Treatment 1917, Spanish citations"
cd "$path\results\Treatment 1917, Spanish citations"	
synth cite_Spanish ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political ///
	cite_Spanish(1914(1)1916) cite_Spanish(1908(1)1910) cite_Spanish(1902(1)1904) cite_Spanish(1896(1)1898) cite_Spanish(1890(1)1892) cite_Spanish(1884(1)1886) cite_Spanish(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace
		
	save_synth_output
	
synth_runner cite_Spanish ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish Socialist Political ///
	cite_Spanish(1914(1)1916) cite_Spanish(1908(1)1910) cite_Spanish(1902(1)1904) cite_Spanish(1896(1)1898) cite_Spanish(1890(1)1892) cite_Spanish(1884(1)1886) cite_Spanish(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel 
		
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_Spanish_synth effect pre_rmspe post_rmspe
	parallel clean		
	
*******************************************************************************
* Placebo test:
* What was happening to Marx's chief counterfactuals, the other socialists?
* Did they have treatment effects?
* Let's run synths on them, just to see what they looked like, visually, and make sure nothing funny is going on with them.
*
* In all these, we exclude both "Karl Marx" and "Marx" from the donor pool
*******************************************************************************

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

* Drop "Karl Marx" too, because we don't want his treatment effect (assuming one exists) to bias the estimates of his counterfactuals.
drop if Name == "Karl Marx"
tabulate Name if Socialist == 1

cd "$path\results\"
capture mkdir "Treatment 1917, socialist donors only"
cd "$path\results\Treatment 1917, socialist donors only"
capture mkdir "synth for socialist counterparts"
cd "$path\results\Treatment 1917, socialist donors only\synth for socialist counterparts"

* We will be using the parallel module to run a program that will run several synths in parallel. However, 
* whenever we try to use the current dataset in memory for multiple processes, parallel fails. Instead,
* we have to use the "nodata" argument in parallel, and then let each individual process load its own data. 
* Therefore, we save the data to disk, so that each process can load the data independently. At the end,
* we will erase this temporary data.
save "socialist_synth_tempdata.dta", replace

* This program is simply a synth regression for any given, individual, arbitrary socialist donor unit.
capture program drop SocialistCounterfactualSynth 
program SocialistCounterfactualSynth

	args Name 
	
	capture mkdir "`Name'"
	cd "$path\results\Treatment 1917, socialist donors only\synth for socialist counterparts\\`Name'" 
	quietly sum Name_no if Name == "`Name'"
	local trunit = r(mean)
	
	synth cite_English ///
		YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
		cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
		trunit(`trunit') trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
		fig keep("synth_results") replace
		
		save_synth_output

end

* We create a program RunSocialistSynthsInParallel in which a series of conditional "if $pll_instance = # {}" is used to run multiple code blocks in parallel.
* Then this program is executed by parallel, programs(...): RunSocialistSynthsInParallel. See below for commentary on the parameters of parallel, programs(...): RunSocialistSynthsInParallel.
* For an example of this, see "help parallel", s.v. "Example 6: Using -pll_instance- and -PLL_CHILDREN- macros".
capture program drop RunSocialistSynthsInParallel
program define RunSocialistSynthsInParallel

	use "socialist_synth_tempdata.dta"
	
	if ($pll_instance == 1) {
		SocialistCounterfactualSynth "August Bebel"
	}
	else if ($pll_instance == 2) {
		SocialistCounterfactualSynth "Bakunin"
	}
	else if ($pll_instance == 3) {
		SocialistCounterfactualSynth "Blanqui"
	}
	else if ($pll_instance == 4) {
		SocialistCounterfactualSynth "Charles Fourier"
	}
	else if ($pll_instance == 5) {
		SocialistCounterfactualSynth "Eduard Bernstein"
	}
	else if ($pll_instance == 6) {
		SocialistCounterfactualSynth "Edward Bellamy"
	}
	else if ($pll_instance == 7) {
		SocialistCounterfactualSynth "Ferdinand Lassalle"
	}
	else if ($pll_instance == 8) {
		SocialistCounterfactualSynth "Henry George"
	}
	else if ($pll_instance == 9) {
		SocialistCounterfactualSynth "John Ruskin"
	}
	else if ($pll_instance == 10) {
		SocialistCounterfactualSynth "Kropotkin"
	}
	else if ($pll_instance == 11) {
		SocialistCounterfactualSynth "Lujo Brentano"
	}
	else if ($pll_instance == 12) {
		SocialistCounterfactualSynth "Moses Hess"
	}
	else if ($pll_instance == 13) {
		SocialistCounterfactualSynth "Oscar Wilde"
	}
	else if ($pll_instance == 14) {
		SocialistCounterfactualSynth "Proudhon"
	}
	else if ($pll_instance == 15) {	
		SocialistCounterfactualSynth "Robert Owen"
	}
	else if ($pll_instance == 16) {
		SocialistCounterfactualSynth "Rodbertus"
	}
	else if ($pll_instance == 17) {
		SocialistCounterfactualSynth "Sismondi"
	}
	else if ($pll_instance == 18) {
		SocialistCounterfactualSynth "Sombart"
	}
	else if ($pll_instance == 19) {
		SocialistCounterfactualSynth "Thomas Carlyle"
	}

end

* Run the program we just created - RunSocialistSynthsInParallel - in parallel.
* I found that when each thread of RunSocialistSynthsInParallel tried to access the data in memory, parallel would fail with errors.
* So instead, we use the "nodata" parameter and then let each individual thread load its own data - using the temporary data we already saved to disk.
* Finally, programs(...) is passed the name of every program which will be required by any of the threads of RunSocialistSynthsInParallel.
parallel, nodata programs(RunSocialistSynthsInParallel SocialistCounterfactualSynth save_synth_output NormalizeVariables NumberUnits): RunSocialistSynthsInParallel
parallel clean
erase "$path\results\Treatment 1917, socialist donors only\synth for socialist counterparts\socialist_synth_tempdata.dta"	

**************************************************************************************************************************************************************
* Robustness: include only socialists in our sample, for synth_runner (for p-values).
* Now we drop the Socialist and Political indicator variables because all our donors are socialists, and all socialists are political
* We also have to drop wrote_Greek, wrote_Latin, wrote_Spanish, and wrote_Italian because none of the socialists wrote in either of these languages.
* And we drop wrote_French because of the dummy trap; every socialist who didn't write in English or German must have written in French, so we don't need wrote_French
**************************************************************************************************************************************************************	
* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
* Keep only socialists
keep if Socialist == 1
NormalizeVariables
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\Treatment 1917, socialist donors only"

synth cite_English ///
	YearofPublication wrote_English wrote_German YearofTranslationtoEnglish ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace 
		
	save_synth_output
	
synth_runner cite_English ///
	YearofPublication wrote_English wrote_German YearofTranslationtoEnglish ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace /// 
	parallel
		
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean	
	
**************************************************************************************************************************************************************
* Robustness: include only NON-socialists in our sample (besides Karl Marx himself, whom we obviously keep)
* Now we drop the Socialist indicator but keep the Political indicator, because all our donors are non-socialists, but some are Political while some are not
**************************************************************************************************************************************************************	
clear
* Use snapshot 4, "Karl Marx, 1878-1932, English"
snapshot restore $Karl_Marx_1878_1932_English
* Keep only "Karl Marx" and non-socialists
keep if Name == "Karl Marx" | Socialist == 0
NormalizeVariables
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results"
capture mkdir "Treatment 1917, NON-socialist donors only"
cd "Treatment 1917, NON-socialist donors only"

synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1878(1)1916) ///
	fig keep("synth_results") replace 
		
	save_synth_output
	
synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace /// 
	parallel 
		
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean	
	
**************************************************************************************************************************************************************
* Robustness: divide pre-treatment period in half between "training" and "validation" periods.
* We use only the first half the pre-treatment period as 
* indicators, and minimize the RMSPE in the second half of the pre-treatment period. 
* Intuitively, we are mitigating overfitting by choosing a set of weights such that
* indicators from one part of the pre-treatment period are able to predict outcomes
* in another part of the pre-treatment period.
**************************************************************************************************************************************************************
* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1917, training & validation"
cd "$path\results\Treatment 1917, training & validation"

synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
																				cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) resultsperiod(1878(1)1932) mspeperiod(1899(1)1916) ///
	fig keep("synth_results") replace 
	
	save_synth_output

synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
																				cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1899(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel
	
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean

**************************************************************************************************************************************************************
* Robustness: change Henry George from socialist to not. But otherwise same regressions as the first one.
**************************************************************************************************************************************************************	

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Change Henry George to a non-socialist
preserve
replace Socialist = 0 if Name == "Henry George"

cd "$path\results\"
capture mkdir "Treatment 1917, change Henry George to non-socialist"
cd "$path\results\Treatment 1917, change Henry George to non-socialist"

baseline_regression
	
* Change Henry George back
restore

clear
	
**************************************************************************************************************************************************************	
* Baseline results, but using "Marx" instead of "Karl Marx"
* On Google Ngrams, "Marx" and "Karl Marx" have essentially the same shape, so this is essentially testing whether using a different level with the same
* rate of change produces the same results. We hope to get the same result, because we cannot identify levels, because we don't know whether "Marx" or 
* "Karl Marx" is more correct. For the same reason, later on, we will normalize outcomes to 1 in an arbitrary year to see if we get the same results.
*
* We have to reload the data and keep "Marx" instead of keeping "Karl Marx."
* Treatment unit is 33 now.
**************************************************************************************************************************************************************
* Use snapshot 5, "Marx, 1878-1932, English"
clear
snapshot restore $Marx_1878_1932_English
NumberUnits

quietly sum Name_no if Name == "Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1917, using ''Marx''"
cd "$path\results\Treatment 1917, using ''Marx''"

baseline_regression

*******************************************************************************
* In-time placebo test: let the treatment be 1889, so that 15 years later is
* just prior to the 1905 Russian Revolution, which was the "dress rehearsal"
* of the 1917 revolution.
*
* Reload the data with different years.
*******************************************************************************
* Use snapshot 2, "All authors, all years"
clear
snapshot restore $all_authors_all_years_excl_Hegel
* Keep "Karl Marx" and drop "Marx"
drop if Name == "Marx"

* Keep only relevant years
drop if Year < 1850
drop if Year > 1904

keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)

* We start tracking our in-time placebo at 1850, i.e. 39 years before our
* treatment, just as our previous regressions began the pre-treatment period in 
* 1878, 39 years prior to our treatment date (1917). 
*
* Because our pre-treatment period begins so early, a few authors lack non-zero 
* outcomes. This may make it difficult to form synthetic counterfactuals for these 
* authors, making the pre-treatment RMSPE large. 
*
* Therefore, we entirely drop any authors whose first 10 years - 1850 to 1859 - 
* have more than 5 zero outcomes (half of the period).
*
* To drop observations, we have to refer to row numbers. To ensure that every 
* observation's row numbers correspond to the same years, we assert that the 
* dataset is strongly balanced. This should be true, or else synth will have 
* previously failed, but we test it anyway just to be sure.
*
* In addition, we don't want to call NumberUnits until after every unit that will 
* be dropped is dropped, because once units are numbered (encoded), those numbers
* become permanent somehow, even if Name_no is dropped and re-encoded. So to tsset,
* we generate a temporary numbering.
* 
encode Name, generate(temp_Name_no)
tsset temp_Name_no Year
sort temp_Name_no Year
assert r(balanced) == "strongly balanced"
gen zero_outcome = .
replace zero_outcome = 0 if cite_English > 0.0 & Year <= 1859
replace zero_outcome = 1 if cite_English <= 0.0 & Year <= 1859
by temp_Name_no: egen total_zero_outcome = total(zero_outcome)
tabulate Name if total_zero_outcome > 5
drop if total_zero_outcome > 5
drop temp_Name_no zero_outcome total_zero_outcome

NormalizeVariables	
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1889 (Placebo)"
cd "$path\results\Treatment 1889 (Placebo)"

synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1886(1)1888) cite_English(1880(1)1882) cite_English(1874(1)1876) cite_English(1868(1)1870) cite_English(1862(1)1864) cite_English(1856(1)1858) cite_English(1850(1)1852), $convergence ///
	trunit($trunit) trperiod(1889) resultsperiod(1850(1)1904) mspeperiod(1850(1)1888) ///
	fig keep("synth_results") replace 
	
	save_synth_output

synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1886(1)1888) cite_English(1880(1)1882) cite_English(1874(1)1876) cite_English(1868(1)1870) cite_English(1862(1)1864) cite_English(1856(1)1858) cite_English(1850(1)1852), $convergence ///
	trunit($trunit) trperiod(1889) mspeperiod(1850(1)1888) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel 

	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean	

**************************************************************************************************************************************************************	
* In-time placebo, using 1905 as the treatment period - for the 1905 Russian Revolution
* 
* Once, I (Michael Makovi) had arbitarily used 1901 as the treatment period for an in-time placebo, simply because its 15 year treatment period would end just 
* prior to 1917. It was a convenient way to move the window back. But I got a statistically significant treatment effect, and I was worried. I asked, "Did anything 
* noteworthy happen in 1905?", and that's when I learned about the 1905 Russian Revolution. And then I realized it was a confounder.
*
* Since then, we've gotten additional donors in our pool, so now we are testing the 1905 Russian Revolution deliberately and explicitly. Having seen the 1901 results,
* with the smaller sample, and having reconsidered the 1905 Russian Revolution, we formed this hypothesis: that the 1905 Russian Revolution had a treatment effect but 
* that it dissipated and that by 1917, Marx had mostly returned to trend. We test this hypothesis using a larger sample, including donors that were not in our pool 
* when we first saw the 1901 results.
*
* We should expect the 1905 Russian Revolution to have some effect for the exact same reason as the 1917 revolution. However, because the 1905 revolution was ultimately
* abortive, while the 1917 had enduring effects, we expect that the treatment effect of 1905 will become statistically insignificant by 1917.
*
* So our prediction is this: the joint post std p-value (post-RMSPE over pre-RMSPE ratio) will be significant, but the final years will not be individually significant,
* and the treatment effect by 1917 will be very small, as Marx returns to trend.
**************************************************************************************************************************************************************
* Use snapshot 2, "All authors, all years"
clear
snapshot restore $all_authors_all_years_excl_Hegel
* Keep "Marx" and drop "Karl Marx"
drop if Name == "Marx"
* Keep only the relevant years - move our window back to 1860-1916
drop if Year < 1860
drop if Year > 1916

keep if !missing(cite_English) & !missing(YearofTranslationtoEnglish)

NormalizeVariables
NumberUnits

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

cd "$path\results\"
capture mkdir "Treatment 1905 Revolution"
cd "$path\results\Treatment 1905 Revolution"

synth cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880) cite_English(1872(1)1874) cite_English(1866(1)1868) cite_English(1860(1)1862), $convergence ///
	trunit($trunit) trperiod(1905) resultsperiod(1860(1)1916) mspeperiod(1860(1)1904) ///
	fig keep("synth_results") replace 

	save_synth_output
	
synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880) cite_English(1872(1)1874) cite_English(1866(1)1868) cite_English(1860(1)1862), $convergence ///
	trunit($trunit) trperiod(1905) mspeperiod(1860(1)1904) ///
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel

	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe
	parallel clean				

*******************************************************************************
* Regressions on trends
* One problem we have is that we aren't sure we are identifying levels. E.g., we use "Karl Marx"
* but "Rodbertus." So we are mixing lastname-only with firstname-lastname. We can't be consistent
* because some people have common lastnames - e.g. Adam Smith - while other people have 
* inconsistent or varying full names - e.g. Pierre-Joseph Proudhon with and without the hyphen
* and with and without the middle name "Joseph."
*
* So we can't be sure we are identifying levels properly. Our identification strategy is 
* based on the assumption that synthetic control can match trends (rates of growth) even if
* the levels are mismeasured.
*
* But if that's true, then we should get the same results if we re-scale everyone arbitrarily.
* Now, synth_runner has a trends option, which is supposed to normalize everyone to an outcome
* of 1 in the last pre-treatment year. But synth lacks any similar option. Furthermore, we want 
* to try normalizing in different years. So we run a set of iterations of our baseline regression 
* in which we arbitrarily scale everyone's outcome variable to be equal to one in each of the 
* pre-treatment years.
*******************************************************************************

* Reduce the level of convergence from $convergence to $convergence_normalization
* Since we will be calling baseline_regression, which has already been defined to use
* $convergence, we simply change the value of $convergence to the value of $convergence_normalization
* The original value of $convergence has already been saved in $convergence_backup, so we could always
* change it back if necessary.
global convergence = "$convergence_normalization"

cd "$path\results\"
capture mkdir "normalize"
cd "$path\results\normalize"

* Because we run so many normalizations in different years, we want to save some summary results of all the different normalizations, in a single spreadsheet
* This spreadsheet will have already been initialized (e.g. with column names) when this program is called
* We save the number of placebos, the joint post std p-value, the AEP p-value, and the Simes p-value for each normalization
capture program drop save_normalization_summary
program define save_normalization_summary
	
	args year

	cd "$path\results\normalize"
	putexcel set "normalization_results.xlsx", modify sheet(Normalization results)	
	
	matrix no_placebos = e(n_pl)
	matrix joint_post_std_p_value = e(pval_joint_post_std)
	matrix pval_std_aep = e(pval_std_AEP)
	matrix pval_std_simes = e(pval_std_simes)
	putexcel (A$i) = `year'
	putexcel (B$i) = no_placebos[1,1], names overwritefmt
	putexcel (C$i) = joint_post_std_p_value[1,1], names overwritefmt
	putexcel (D$i) = pval_std_aep[1,1], names overwritefmt
	putexcel (E$i) = pval_std_simes[1,1], names overwritefmt
	
	* Obtain the percent difference in outcomes for (1) the final post-treatment year, and (2) the entire post-treatment period.
	* We treat the synthetic outcome as the base, because it is the "true" value against which differences should be evaluated.
	*
	* First, percent difference in the final post-treatment year.
	matrix outcomes = e(treat_control)
	local last_row_number = rowsof(outcomes)
	local pct_diff_final_year = (outcomes[`last_row_number',1]-outcomes[`last_row_number',2]) / outcomes[`last_row_number',2] * 100
	putexcel (F$i) = `pct_diff_final_year', names overwritefmt
	* 
	* Second, percent difference for the entire post-treatment period.
	local num_of_post_treat_outcomes = colsof(e(b))
	matrix post_treatment_outcomes = e(treat_control)["post1".."post`num_of_post_treat_outcomes'",1...]
	mata: post_treatment_outcomes = st_matrix("post_treatment_outcomes")
		mata: sums = colsum(post_treatment_outcomes)
		mata: pct_diff_avg = (sums[1,1]-sums[1,2])/sums[1,2]*100
		mata: st_matrix("pct_diff_avg",pct_diff_avg)
	local pct_diff_avg = pct_diff_avg[1,1]
	putexcel (G$i) = `pct_diff_avg', names overwritefmt
	mata: mata clear
	
end

* This program will perform a baseline SCM regression that normalizes outcomes to 1 in any arbitrary year, where the year is an argument
capture program drop normalized_regression
program define normalized_regression
	
	args year
	
	* Use snapshot 4, "Karl Marx, 1878-1932, English"
	clear
	snapshot restore $Karl_Marx_1878_1932_English
	NumberUnits
	quietly sum Name_no if Name == "Karl Marx"
	global trunit = r(mean)
	
	* We are going to be normalizing cite_English, so preserve the dataset as it was before normalization
	* This is unnecessary due to the snapshot restore we run every time, but we use preserve and restore too just to be safe
	* preserve
	
	* cite_English_in_`year' is equal to cite_English[X], where X is a row number
	* Within a bysort Name (Year), row number X is determined by counting from row 1 in 1878 until 
	* we get to `year'. For example 1881 is row 4. More generally, the first year is Year[1] 
	* rather than 1878. Therefore, (`year' - Year[1] + 1) is the row number of `year', given
	* that row 1 is Year[1]. Thus, cite_English_in_`year' = cite_English[`year' - Year[1] + 1]
	bysort Name (Year): gen cite_English_in_`year' = cite_English[`year' - Year[1] + 1]

	* We're going to be normalizing by dividing cite_English by the value of cite_English_in_`year'
	* The problem is that a few observations (approximately 10 obs) have 
	* values of cite_English_in_`year' of exactly 0 (zero), so they create division by zero errors. 
	* To solve this problem, I'm going to replace zeroes with values equal to the smallest 
	* non-zero value (in any year) for that particular author.
	*
	* We have to perform two steps to find the minimum non-zero value of min_cite_English
	* First, "bysort Name: egen min_cite_English = min(cite_English) if cite_English > 0" 
	* finds the minimum non-zero value, but only for the observations which have a non-zero value.
	* That is, for the observations for which cite_English == 0, min_cite_English 
	* has a missing value. So then we have to find the mean value of min_cite_English.
	* The mean value will be the mean of missing values and of identical present values,
	* and thus the mean value will be equal to the present values - but now every 
	* observation has that value, including the observations for which cite_English == 0.
	bysort Name: egen min_cite_English = min(cite_English) if cite_English > 0
	bysort Name: egen min_cite_English_m = mean(min_cite_English)
	replace cite_English_in_`year' = min_cite_English_m if cite_English_in_`year' == 0
	
	gen cite_English_norm = cite_English / cite_English_in_`year'
	
	* replace cite_English with cite_English_norm so that we can call the baseline_regression program
	replace cite_English = cite_English_norm 
	
	* After normalizing all authors have to outcome == 1 in a given arbitrary year, to ensure that every author is on the same level (in that year),
	* we re-normalize so that all variables fall in the [0,1] interval, to accelerate convergence.
	NormalizeVariables
	
	cd "$path\results\normalize"
	capture mkdir "Treatment 1917 - normalize `year'"
	cd "$path\results\normalize\Treatment 1917 - normalize `year'"
	
	* We capture the SCM regression because sometimes, it fails to converge, which throws an error and halts termination.
	* If _rc == 0 i.e. there is no error, then add the result to the list of results by calling the program save_normalization_summary
	* But if _rc != 0 i.e. there is an error then just write the year to the results, but leave everything else (no. placebos, p-value) blank
	capture noisily {
		baseline_regression
	}
	if _rc == 0 {
		save_normalization_summary `year'
	}
	else if _rc != 0 {
		cd "$path\results\normalize"
		putexcel set "normalization_results.xlsx", modify sheet(Normalization results)
		putexcel (A$i) = `year'
	}
	
	* Restore the dataset as it was before we normalized cite_English
	* restore 

end

************************************************** 	
* Regressions, normalize every year
************************************************** 

cd "$path\results\normalize"

* Initialize the normalization summary results
putexcel set "normalization_results.xlsx", replace sheet(Normalization results)
putexcel (A1) = "Normalization year"
putexcel (B1) = "No. placebos"
putexcel (C1) = "Joint post std p"
putexcel (D1) = "AEP p-value"
putexcel (E1) = "Simes p-value"
putexcel (F1) = "Pct diff btw treated & synth - 1932"
putexcel (G1) = "Pct diff btw treated & synth - 1917-32"

* i indicates the row number that we will be saving the results in a spreadsheet
* In each iteration of the loop, we save the year, the number of placeobs, and the joint-post-std p-value in a new row of the spreadsheet
* For some reason, putexcel would sometimes fail when i was a local, so I made it a global instead
global i = 2
global step = 1
if ($quick_bug_testing == 1) {
	global step = 9
}
foreach year of numlist 1878($step)1916 {
	* The following line, which is commented out, is useful for testing whether this code successfully accounts for iterations of SCM that fail
	* to converge. In the program normalized_regression, when there is a failure to converge, then instead of calling save_normalization_summary,
	* the program will save the year but neither a number of placebos nor a p-value. The result will be a table in which each row is an attempted test,
	* the columns are year, no. placebos, and p-value, and some rows have values in all three columns but some rows have only a value of year
	* and a blank for no. placebos and p-value. Then, when we perform meta-analysis below, the rows with blanks for no. placebos and/or p-value
	* will be ignored. Because SCM takes so long to run, it is difficult to test this code. The following foreach loop will deliberately cause
	* errors by setting convergence to "" (to avoid the time taken by nested) and then calling normalized_regression for years in which there 
	* is no data at all - which will cause SCM to fail similarly to the way it fails when it cannot achieve convergence. 
	*
	* global convergence = ""
	* foreach year of numlist 1890 2000 2001 1892 {

	normalized_regression `year'

	global i = $i + 1
	
}

program drop normalized_regression
	
/*
************************************************** 	
* Baseline regression, except this time, we use synth_runner's "trends" option to normalize in 1916. 
*
* According to the help file, "trends will force synth to match on the trends in 
* the outcome variable. It does this by scaling each unit's outcome variable so 
* that it is 1 in the last pre-treatment period."
************************************************** 
cd "$path\results\normalize"
capture mkdir "Treatment 1917 - synth_runner trends"
cd "$path\results\normalize\Treatment 1917 - synth_runner trends"

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

capture noisily {
	synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel trends
	
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe cite_English_scaled cite_English_scaled_synth effect_scaled
	parallel clean	
}
if _rc == 0 {
	* Year argument is a number, so for now, we set the year to 9999, and later, we will replace that spreadsheet cell with the string "1916 ''Trends''"
	save_normalization_summary 9999
}
else if _rc != 0 {
}

cd "$path\results\normalize"
putexcel set "normalization_results.xlsx", modify sheet(Normalization results)
* replace the year 9999 with "Trends" in the normalization summary spreadsheet
putexcel (A$i) = "1916 ''Trends''"
*/
	
************************************************** 
* Meta-analysis of our normalized regressions
* We read the spreadsheet where we just saved all the results,
* count how many results were statistically significant, and 
* save that to the same spreadsheet.
************************************************** 
clear
cd "$path\results\normalize"
import excel "normalization_results.xlsx", sheet("Normalization results") firstrow
ren Pctdiffbtwtreatedsynth1 Pctdiffbtwtreatedsynth1932
ren G 						Pctdiffbtwtreatedsynth1917_32

* We have some years in which SCM could not converge. We want to distinguish between
* the number of years in which a test was attempted, and the number of successful tests.
* If a year was attempted, there will be a value of Normalizationyear, but there may 
* not be a value of Noplacebos and/or Jointpoststdp

* First, drop any rows that don't have a value for Normalizationyear. These rows - if
* they exist - have no meaningful data. The remaining observations indicate an
* attempted test, whether or not it converged and produced results.
drop if missing(Normalizationyear)
gen num_attempted_tests = _N

* Now drop any rows which have no value for Noplacebos or Jointpoststdp. These 
* represent tests which did not converge and therefore failed to produce any results
drop if missing(Noplacebos) | missing(Jointpoststdp)
gen num_completed_tests = _N

destring Jointpoststdp, replace
egen sig_01_Jointpoststdp = total(Jointpoststdp < 0.01)
egen sig_05_Jointpoststdp = total(Jointpoststdp < 0.05)
egen sig_10_Jointpoststdp = total(Jointpoststdp < 0.10)
egen not_sig_Jointpoststdp = total(Jointpoststdp >= 0.10)

destring AEPpvalue, replace
egen sig_01_AEPpvalue = total(AEPpvalue < 0.01)
egen sig_05_AEPpvalue = total(AEPpvalue < 0.05)
egen sig_10_AEPpvalue = total(AEPpvalue < 0.10)
egen not_sig_AEPpvalue = total(AEPpvalue >= 0.10)

destring Simespvalue, replace
egen sig_01_Simespvalue = total(Simespvalue < 0.01)
egen sig_05_Simespvalue = total(Simespvalue < 0.05)
egen sig_10_Simespvalue = total(Simespvalue < 0.10)
egen not_sig_Simespvalue = total(Simespvalue >= 0.10)

putexcel set "normalization_results.xlsx", modify sheet(Normalization results)
* We start at row num_attempted_tests[1]+3 because in the spreadshoot, row 1 is the variable names, so row num_attempted_tests[1]+1 is the last observation 
* Then, we want a blank row at num_attempted_tests[1]+2, so we start at num_attempted_tests[1]+3
global i = num_attempted_tests[1] + 3
quietly mean Pctdiffbtwtreatedsynth1932
matrix mean_results_1932 = r(table)
quietly mean Pctdiffbtwtreatedsynth1917_32
matrix mean_results_1917_32 = r(table)
putexcel (A$i) = "Mean pct diff:"
putexcel (F$i) = mean_results_1932[1,1], names overwritefmt
putexcel (G$i) = mean_results_1917_32[1,1], names overwritefmt
global i = $i + 2
putexcel (A$i) = "No. tests:"
putexcel (C$i) = num_completed_tests[1], names overwritefmt
putexcel (D$i) = num_completed_tests[1], names overwritefmt
putexcel (E$i) = num_completed_tests[1], names overwritefmt
global i = $i + 1
putexcel (A$i) = "No. tests signif @ 0.01:"
putexcel (C$i) = sig_01_Jointpoststdp[1], names overwritefmt
putexcel (D$i) = sig_01_AEPpvalue[1], names overwritefmt
putexcel (E$i) = sig_01_Simespvalue[1], names overwritefmt
global i = $i + 1
putexcel (A$i) = "No. tests signif @ 0.05:"
putexcel (C$i) = sig_05_Jointpoststdp[1], names overwritefmt
putexcel (D$i) = sig_05_AEPpvalue[1], names overwritefmt
putexcel (E$i) = sig_05_Simespvalue[1], names overwritefmt
global i = $i + 1
putexcel (A$i) = "No. tests signif @ 0.10:"
putexcel (C$i) = sig_10_Jointpoststdp[1], names overwritefmt
putexcel (D$i) = sig_10_AEPpvalue[1], names overwritefmt
putexcel (E$i) = sig_10_Simespvalue[1], names overwritefmt
global i = $i + 1
putexcel (A$i) = "No. tests not significant:"
putexcel (C$i) = not_sig_Jointpoststdp[1], names overwritefmt
putexcel (D$i) = not_sig_AEPpvalue[1], names overwritefmt
putexcel (E$i) = not_sig_Simespvalue[1], names overwritefmt

putexcel save

************************************************** 
* Now, we perform meta-analysis of these p-values
* Wilson's harmonic mean p-value is undefined for p=0. And the Simes test,
* although technically defined for p=0, makes no sense with p=0, because a single
* p-value of zero will guarantee the minimum p-value is zero, even if every other
* p-value is 0.999. Therefore, we generate a set of non-zero joint postd std p-values.
* The non-zero p-values are generated by replacing zeroes with the smallest
* possible non-zero p-values, which are equal to 1/no. placebos
**************************************************
gen joint_post_std_p_nonzero = Jointpoststdp
replace joint_post_std_p_nonzero = 1/Noplacebos if joint_post_std_p_nonzero == 0

************************************************** 
* Using metap by Aurelio Tobias, we use these methods:
* Fisher's method, Edgington's additive method, and Edgington's normal curve method
* Note that Fisher's method is undefined for p=0, so we use p_val_nonzero
************************************************** 
/*
metap p_val_nonzero, method(f)
	matrix fisher_p = r(pvalue)
metap Jointpoststdp, method(ea)
	matrix edgington_additive_p = r(pvalue)
metap Jointpoststdp, method(en)
	matrix edgington_normal_p = r(pvalue)
	
* To save these results in the spreadsheet, We move two rows ahead, one to leave a blank and one to start entering data
putexcel set "normalization_results.xlsx", modify sheet(Normalization results)
global i = $i + 2	
*
putexcel (A$i) = "Fisher's meta-p:"
putexcel (C$i) = fisher_p[1,1], names overwritefmt
*
global i = $i + 1
putexcel (A$i) = "Edgington's additive meta-p:"
putexcel (C$i) = edgington_additive_p[1,1], names overwritefmt
*
global i = $i + 1
putexcel (A$i) = "Edgington's normal curve meta-p:"
putexcel (C$i) = edgington_normal_p[1,1], names overwritefmt
*/
	
************************************************** 
* Next, we use Wilson's harmonic mean p-value (HMP), which does not assume independence of p-values.
* This is important because our p-values from all these normalization tests are almost certainly correlated
* to an extremely high degree
*
* We have to use R to generate the harmonic mean p-value (HMP) and the asymptotically exact p-value,
* according to Wilson (2019), "The harmonic mean p-value for combining dependent tests". Proceedings of the National Academy of Sciences USA. 116 (4): 1195–1200. 
* Wilson wrote the packages in R, so we have to use R from within Stata
************************************************** 

/*
* Let's do some tests to be sure this is working. We're using
* R within Stata, so there's a lot to go wrong. 
* Wilson has provided a lookup table, so let's see if we can replicate those lookup table values, within Stata.
* If that works, we'll proceed to do our real work.
*
* According to the lookup table, an HMP of 0.040 with 10 tests implies alpha<0.05, so the following should return 0.05
* pharmonicmeanp takes an HMP and a number of tests (L) and returns an asymptotically-exact p-value
rcall: pharmonicmeanp(0.040, L=10, lower.tail=TRUE)

* According to the lookup table, an HMP of 0.0233 with 1 billion tests implies alpha<0.05, so the following should return 0.05
rcall: pharmonicmeanp(0.023, L=1000000000, lower.tail=TRUE)

* Let's manually put in 10 HMPs each equal to 0.040, unweighted, and calculate the exact p-value.
* We should get 0.05
* p.hmp takes a vector of ordinary p-values and returns an asymptotically exact p-value
rcall: x <- c(0.040, 0.040, 0.040, 0.040, 0.040, 0.040, 0.040, 0.040, 0.040, 0.040)
rcall: p.hmp(x)

* It seems to be working. So now we can do our real work.
*/

* So now we have to pass all our variables into R, so that we can calculate weighted HMP and exact p-values. The no_placebos will be used to generate weights
* For each of our p-values, we use rcall: x <- st.var(x) to pass the variable to R. But this creates a list, where the 0th column is the index, and the 
* and the 1st column is the actual p-value we want. So then we use rcall: x <- x[,1] to extract a vector from the list.
egen total_placebos = total(Noplacebos)
gen weights = Noplacebos/total_placebos
rcall: joint_post_std_p_nonzero <- st.var(joint_post_std_p_nonzero)
	rcall: joint_post_std_p_nonzero <- joint_post_std_p_nonzero[,1]
rcall: AEPpvalue <- st.var(AEPpvalue)
	rcall: AEPpvalue <- AEPpvalue[,1]
rcall: Simespvalue = st.var(Simespvalue)
	rcall: Simespvalue <- Simespvalue[,1]
rcall: weights <- st.var(weights)
	rcall: weights <- weights[,1]

* hmp.stat takes a vector of ordinary p-values and returns a harmonic mean p-value (HMP)
* rcall: hmp_unweighted <- hmp.stat(p_val_nonzero)

* hmp.stat can also take a vector of weights. We will use the number of placebos as a fraction of total placebos as the weights
* rcall: hmp_weighted <- hmp.stat(p_val_nonzero,weights)

* p.hmp takes a vector of ordinary p-values and returns an asymptotically exact p-value
* We have three sets of p-values, where each row or observation is a normalization year: we have the joint post std p-value, 
* the AEP p-value which combines all the individual years' std p-values, and the Simes p-value which combines all the individual
* years' p-values. We have each of these three p-values for every normalization year, so now we use Wilson's HMP/AEP to combine them.
* That is, each normalization year has a meta-p-value formed by AEP and Simes, and now, we meta-analyze the meta-p-values.
rcall: jointpoststdp_aep_unweighted <- p.hmp(joint_post_std_p_nonzero)
rcall: AEPpvalue_aep_unweighted <- p.hmp(AEPpvalue)
rcall: Simespvalue_aep_unweighted <- p.hmp(Simespvalue)

* p.hmp can also take a vector of weights. We will use the number of placebos as a fraction of total placebos as the weights
rcall: jointpoststdp_aep_weighted <- p.hmp(joint_post_std_p_nonzero,weights)
rcall: AEPpvalue_aep_weighted <- p.hmp(AEPpvalue,weights)
rcall: Simespvalue_aep_weighted <- p.hmp(Simespvalue,weights)

/* Daniel Wilson emailed me and told me that he has a forthcoming paper in PNAS
responding to a new criticism of the HMP. Apparently, under certain forms of dependence,
the HMP is not conservative enough, and a target alpha of 0.05 will be elevated to 0.07.
He suggested that I use a modified, multi-level form of Simes's test.

Wilson's multi-level test is specified in his email as follows:
calc.pSimes = function(p,w=rep(1/L,L),L=length(p)) {
  pstar = p/w
  min(sort(pstar)/(1:length(p)))
}

*/

* The Simes test requires sorting the p-values. It works by penalizing the smallest
* p-values by multiplying all p-values by N/i. If the minimum value of (Pi * N/i)
* is still less than alpha, then the overall test is significant.
* 
* Simes's test technically is defined for p-values of zero, 
* but by construction, the presence of even one p-value of zero will result in Simes's 
* test statistic being zero, no matter how large all the other p-values are. Therefore, 
* we will use the non-zero p-values.

rcall: L <- length(joint_post_std_p_nonzero)
rcall: w <- rep(1/L,L)
rcall: jointpoststd_simes <- min(sort(joint_post_std_p_nonzero/w)/(1:L))

rcall: L <- length(AEPpvalue)
rcall: w <- rep(1/L,L)
rcall: AEPpvalue_simes <- min(sort(AEPpvalue/w)/(1:L))

rcall: L <- length(Simespvalue)
rcall: w <- rep(1/L,L)
rcall: Simespvalue_simes <- min(sort(Simespvalue/w)/(1:L))

capture erase ".Rdata"
capture erase "_load.matrix.pvals_std.dta"

* Now all our R results are available in Stata as r() matrices. So we just output them to the spreadsheet.
* We move two rows ahead, one to leave a blank and one to start entering data
matrix jointpoststdp_aep_unweighted = r(jointpoststdp_aep_unweighted)
matrix AEPpvalue_aep_unweighted = r(AEPpvalue_aep_unweighted)
matrix Simespvalue_aep_unweighted = r(Simespvalue_aep_unweighted)
matrix jointpoststdp_aep_weighted = r(jointpoststdp_aep_weighted)
matrix AEPpvalue_aep_weighted = r(AEPpvalue_aep_weighted)
matrix Simespvalue_aep_weighted = r(Simespvalue_aep_weighted)
matrix jointpoststd_simes = r(jointpoststd_simes)
matrix AEPpvalue_simes = r(AEPpvalue_simes)
matrix Simespvalue_simes = r(Simespvalue_simes)
*
global i = $i + 2
putexcel (A$i) = "Asymptotically exact p-value, unweighted:"
putexcel (C$i) = jointpoststdp_aep_unweighted[1,1], names overwritefmt
putexcel (D$i) = AEPpvalue_aep_unweighted[1,1], names overwritefmt
putexcel (E$i) = Simespvalue_aep_unweighted[1,1], names overwritefmt
global i = $i + 1
putexcel (A$i) = "Asymptotically exact p-value, weighted:"
putexcel (C$i) = jointpoststdp_aep_weighted[1,1], names overwritefmt
putexcel (D$i) = AEPpvalue_aep_weighted[1,1], names overwritefmt
putexcel (E$i) = Simespvalue_aep_weighted[1,1], names overwritefmt
*
global i = $i + 1
putexcel (A$i) = "Simes's p-value:"
putexcel (C$i) = jointpoststd_simes[1,1], names overwritefmt
putexcel (D$i) = AEPpvalue_simes[1,1], names overwritefmt
putexcel (E$i) = Simespvalue_simes[1,1], names overwritefmt

putexcel save

************************************************** 	
* Baseline regression, except this time, we use synth_runner's "trends" option to normalize in 1916. 
*
* According to the help file, "trends will force synth to match on the trends in 
* the outcome variable. It does this by scaling each unit's outcome variable so 
* that it is 1 in the last pre-treatment period."
************************************************** 
cd "$path\results\normalize"
capture mkdir "Treatment 1917 - synth_runner trends"
cd "$path\results\normalize\Treatment 1917 - synth_runner trends"

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits
quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

capture noisily {
	synth_runner cite_English ///
	YearofPublication wrote_English wrote_German wrote_French wrote_Greek wrote_Latin wrote_Italian wrote_Spanish YearofTranslationtoEnglish Socialist Political ///
	cite_English(1914(1)1916) cite_English(1908(1)1910) cite_English(1902(1)1904) cite_English(1896(1)1898) cite_English(1890(1)1892) cite_English(1884(1)1886) cite_English(1878(1)1880), $convergence ///
	trunit($trunit) trperiod(1917) mspeperiod(1878(1)1916) /// 
	noredo_tr_error ///
	gen_vars ///
	keep("synth_runner_results") replace ///
	parallel trends
	
	save_synth_runner_output
	
	* Cleanup
	drop lead cite_English_synth effect pre_rmspe post_rmspe cite_English_scaled cite_English_scaled_synth effect_scaled
	parallel clean	
}

**************************************************************************************************************************************************************
* Baseline regression, except test whether BFGS gives the same result as DFP. (Assuming we have  been using DFP until now.)
*
* We start with $convergence_normalization instead of $convergence because $convergence_normalization  is set with a lower value of maxiter(#). 
* BFGS requires about 1.5 times as much RAM as DFP, and more iterations requires more RAM too (for some reason). So using BFGS implies we 
* should restrict to fewer iterations to avoid running out of RAM.  
*
* Thus, we take $convergence_normalization and replace the string "dfp" with "bfgs".
*
* Furthermore, we standardize our data to mean 0 & std dev 1, whereas until now, our data have been min-max normalized.
* Thus, we will be testing the robustness of our results to different numerical optimizers & conditions for those optimizers.
* Instead of DFP, min-max normalization, maxiter(20), we will have BFGS, standardization, maxiter(15). If we still get essentially the 
* same or a very similar result, then we know our results are probably not driven by the optimizer and/or the optimizer's conditions.
**************************************************************************************************************************************************************
global convergence = subinstr("$convergence_normalization","dfp","bfgs",1)

* Use snapshot 4, "Karl Marx, 1878-1932, English"
clear
snapshot restore $Karl_Marx_1878_1932_English
NumberUnits

foreach variable in cite_English cite_German cite_French YearofPublication YearofTranslationtoEnglish {
	capture quietly sum `variable'
	capture replace `variable' = (`variable' - r(mean)) / r(sd)
}

quietly sum Name_no if Name == "Karl Marx"
global trunit = r(mean)

* Create output folder if it does not already exist
cd "$path\results"
capture mkdir "Treatment 1917 - convergence_normalization except bfgs & standardized"
cd "Treatment 1917 - convergence_normalization except bfgs & standardized"
baseline_regression

**************************************************************************************************************************************************************
* Clear everything and restore any settings which were permanently changed.
**************************************************************************************************************************************************************	
clear 
rcall clear
capture erase "$path\.Rdata"
capture erase "$path\_load.matrix.pvals_std.dta"
parallel clean, all
set lapack_mkl $previous_lapack_setting, permanently
mata: mata set matafavor $previous_matafavor_setting, permanently
log close

